Skip to content

Conversation

chaxu01
Copy link
Collaborator

@chaxu01 chaxu01 commented Aug 27, 2025

This patch fixes the assert issue for the latest whisper.cpp when KleidiAI is enabled: ggml/src/ggml-backend.cpp:1088: GGML_ASSERT(*cur_backend_id != -1) failed.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 27, 2025
@chaxu01 chaxu01 requested a review from ggerganov August 28, 2025 07:26
Comment on lines 90 to 98
static inline bool is_whisper_model(const struct ggml_tensor* op) {
const int64_t n_dims = op->src[0]->ne[0];
if (n_dims == 384 || n_dims == 512 || n_dims == 768 ||
n_dims == 1024 || n_dims == 1280) {
return true;
}
return false;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very hacky - what is the reason to add this check?

Copy link
Collaborator Author

@chaxu01 chaxu01 Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is added to ensure KleidiAI backend only handles GET_ROWS operations for Whisper models, not for LLaMA models becaue of the whisper's specific usage pattern (GET_ROWS -> MUL_MAT in decoder which can be accelerated by the kleidiai kernels). Those dimensions are specific to Whisper models used across different Whisper model sizes.
- 384: Whisper tiny
- 512: Whisper base
- 768: Whisper small
- 1024: Whisper medium
- 1280: Whisper large

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why we need to do the check. What is a failure case that you observe? For example if we have a non-Whisper LLM that calls ggml_get_rows with n_dims = 256 why would we not want to use this path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure case we observed occurs in whisper.cpp after PR #15118 was merged. Specifically, a GGML_ASSERT is triggered when the newly added GET_ROWS support check fails for the KleidiAI backend.

The main reason for enabling GET_ROWS support for Whisper in KleidiAI is that Whisper reuses the same weight tensor for both GET_ROWS and MUL_MAT in the decoder. If GET_ROWS isn't supported, it falls back to CPU — which in turn forces the shared tensor to fall back as well, disabling KleidiAI acceleration for MUL_MAT.

In contrast, LLaMA and similar models don’t reuse weights across these ops, so enabling GET_ROWS support for them would introduce unnecessary repacking without any performance gain.

The n_dims check is a lightweight way to detect Whisper models based on known decoder embedding sizes (384–1280). Happy to revisit or generalize this approach if other models benefit from the same pattern in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov, just wanted to follow up on this one.

The main goal here is to ensure Whisper models benefit from KleidiAI acceleration by keeping GET_ROWS + MUL_MAT on the backend, while avoiding unnecessary repacking for LLaMA and similar models. The n_dims check was meant as a lightweight heuristic for Whisper’s known decoder embedding sizes. Happy to adjust the approach if you prefer a different way of handling Whisper-specific GET_ROWS acceleration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, understand what is the root issue now - same tensor being used both for GET_ROWS and MUL_MAT. I don't think this is the right solution. We should not make such assumptions about the tensor shapes in the underlying implementation as it is very fragile.

In contrast, LLaMA and similar models don’t reuse weights across these ops, so enabling GET_ROWS support for them would introduce unnecessary repacking without any performance gain.

Do you have an estimate of how large the negative impact is in such cases?

In hindsight the logic for repacking tensors used with GET_ROWS was probably premature. The repacking mechanism was designed with a single op in mind. I.e. if the single op that uses the tensor would benefit from repacking then we allow repacking.

In contrast, for this use case, the condition to repack now depends also on the next ops in the graph. This is something we overlooked when implementing the GET_ROWS repacking support and probably have to take a step back on it and think of a better way to support these use cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the shape-based check is fragile and not the right direction long term.

I have done some benchmarks. Since the repacking process happens during model load, I wasn’t able to observe any negative impact on an M4Pro when enabling GET_ROWS for LLaMA and similar models. Even the model load time didn’t show any noticeable change in llama.cpp.

Given that, I think it’s better not to special-case whisper.cpp at all for now. I’ll update the patch accordingly unless you have other thoughts, and I’ll be happy to revisit this once the repack support is updated to better handle cases like Whisper’s shared tensor pattern.

./bin/llama-bench -m ./models/llama-2-7b-q4_0.gguf -ngl 0 -t 4

W/O GET_ROWS
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           pp512 |         85.01 ± 0.52 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           tg128 |         40.08 ± 0.92 |

build: d93dcc27 (6287)

W/ GET_ROWS
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           pp512 |         85.60 ± 0.25 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |           tg128 |         42.49 ± 0.63 |

build: d93dcc27 (6287)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that, I think it’s better not to special-case whisper.cpp at all for now. I’ll update the patch accordingly unless you have other thoughts, and I’ll be happy to revisit this once the repack support is updated to better handle cases like Whisper’s shared tensor pattern.

Ok, lets avoid the special-cases for now since the impact does not seem to be significant. If we don't find cases in which the impact is measurable, then I think it would be OK to keep the GET_ROWS repacking as it is.

@chaxu01 chaxu01 merged commit 2b3efea into ggml-org:master Sep 11, 2025
48 checks passed
@chaxu01 chaxu01 deleted the feature/get-rows-assert branch October 7, 2025 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants